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Abstract 

We study how correlations in the design matrix influence Lasso prediction. 
First, we argue that the higher the correlations are, the smaller the optimal 
tuning parameter is. This implies in particular that the standard tuning param- 
eters, that do not depend on the design matrix, are not favorable. Furthermore, 
we argue that Lasso prediction works well for any degree of correlations if suit- 
able tuning parameters are chosen. We study these two subjects theoretically 
as well as with simulations. 

Keywords: Correlations, Lars Algorithm, Lasso, Restricted Eigenvalue, Tun- 
ing Parameter. 



1 Introduction 



Although the Lasso estimator is very popular and correlations are present in many 
of its diverse applications, the influence of these correlations is still not entirely un- 
derstood. Correlations are surely problematic for parameter estimation and variable 
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selection. The influence of correlations on prediction, however, is far less clear. 

Let us first set the framework for our study. We consider the linear regression 
model 

Y = X/3 + ae, (1) 

where Y £ R n is the response vector, X £ IR nxp is the design matrix, e £ lR n is 
the noise and a £ R + is the noise level. We assume in the following that the noise 
level a is known and that the noise e obeys an n dimensional normal distribution 
with covariance matrix equal to the identity. Moreover, we assume that the design 
matrix X is normalized, that is, (X T X) . . = n for 1 < j < p. Three main tasks are 
then usually considered: estimating /3q (parameter estimation), selecting the non- 
zero components of (3q (variable selection), and estimating X(3q (prediction). Many 
applications of the above regression model are high dimensional, that is, the number 
of variables p is larger than the number of observations n but are also sparse, that is, 
the true solution fio has only few nonzero entries. A computationally feasible method 
for the mentioned tasks is, for instance, the widely used Lasso estimator introduced 
in |Tib96j : 

4 :=axgmin{||y-X^ + Alalia}. 

In this paper, we focus on the prediction error of this estimator for different degrees 
of correlations. The literature on the Lasso estimator has become very large, we re- 
fer the reader to the well written books [BClll IBvdGllj IHTFOlj and the references 
therein. 

Two types of bounds for the prediction error are known in the theory for the 
Lasso estimator. On the one hand, there are the so called fast rate bounds (see 
[BR T09[ IBTW07b| vdGB09j and references therein). These bounds are nearly opti- 
mal but imply restricted eigenvalues or similar conditions and therefore only apply 
for weakly correlated designs. On the other hand, there are the so called slow rate 
bounds (see |HCB08l IKTLlll IMMlll IRTllj ). These bounds are valid for any degree 
of correlations but - as their name suggests - are usually thought of as unfavorable. 

Regarding the mentioned bounds, one could claim that correlations lead in gen- 
eral to large prediction errors. However, recent results in |vdGL12] suggest that this 
is not true. It is argued in |vdGL12] that for (very) highly correlated designs, small 
tuning parameters can be chosen and favorable slow rate bounds are obtained. In the 
present paper, we provide more insight into the relation between Lasso prediction 
and correlations. We find that the larger the correlations are, the smaller the optimal 
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tuning parameter is. Moreover, we find both in theory and simulations that Lasso 
performs well for any degree of correlations if the tuning parameter is chosen suitably. 

We finally give a short outline of this paper: we first discuss the known bounds 
on the Lasso prediction error. Then, after some illustrating numerical results, we 
study the subject theoretically. We then present several simulations, interpret our 
results and finally close with a discussion. 



2 Known Bounds for Lasso Prediction 

To set the context of our contribution, we first discuss briefly the known bounds 
for the prediction error of the Lasso estimator. We refer to the books [B C11] and 
[BvdGllJ for a detailed introduction to the theory of the Lasso. 

Fast rate bounds, on the one hand, are bounds proportional to the square of 
the tuning parameter A. These bounds are only valid for weakly correlated design 
matrices. We first recall the corresponding assumption. Let a be a vector in MP, J 
a subset of {1, . . . ,p}, and finally aj the vector in MP that has the same coordinates 
as a on J and zero coordinates on the complement J c . Denote the cardinality of a 
given set by | ■ |. For a given integer s, the Restricted Eigenvalues (RE) assumption 
introduced in [BRT09J reads then 

Assumption RE(s): 

(pis) := mm mm — — > (J. 

J C{l,...,p}:|Jo|<SA^0:||A jC ||i<3||A Jo || 1 y/n\\AjJ 2 



The integer s plays the role of a sparsity index and is usually comparable to the 
number of nonzero entries of /?o- More precisely, to obtain the following fast rates, 
it is assumed that s > s, where s := \{j : (/3o)j 7^ 0}|. Also, we notice that <p(s) ~ 
corresponds to correlations. Under the above assumption it holds (see for example 
Bickel et al. [BRT09J and more recently Koltchinskii et al. |KTL11] ): 

\\x($-mi<-^ (2) 

on the set T '■= jsup^^jpj^- < Aj. Similar bounds, under slightly different as- 
sumptions, can be found in |vdGB 09j. Usually, the tuning parameter A is chosen 
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proportional to cr^/n log(p). For fixed 0, the above rate then is optimal up to a loga- 
rithmic term (see |BTW07al Theorem 5.1]) and the set T has a high probability (see 
Section f3 . 2 . 2 j) . For correlated designs, however, this choice of the tuning parameter 
is not suitable. This is detailed in the following section. 

Slow rate bounds, on the other hand, are bounds only proportional to the tuning 
parameter A. These bounds are valid for arbitrary designs, in particular, they are 
valid for highly correlated designs. The result |KTLlll Eq. (2.3) in Theorem 1] 
yields in our setting 

\\X0-p o )g<2X\\/3 o \\ 1 (3) 

on the set T. Similar bounds can be found in [HCB08J (for a related work on a 
truncated version of the Lasso), in [ MM11[ Theorem 3.1] (for estimation of a general 



function in a Banach space), and in |RTlll Theorem 4.1] (which also applies to the 
non-parametric setting; note that the corresponding bound can be written in the 
form above with arbitrary tuning parameter A) . We note that these bounds depend 
on ||/3o||i instead of s. Moreover, they depend on A to the first power, and these 
bounds are therefore considered unfavorable compared to the fast rate bounds. 

The mentioned bounds are only useful for sufficiently large tuning parameters 
such that the set T has a high probability. This is crucial for the following. We show 
that the higher the correlations, the larger the probability of T is. Correlations thus 
allow for small tuning parameters; this implies for correlated designs, via the factor 
A in the slow rate bounds, favorable bounds even though no fast rate bounds are 
available. 

Remark 2.1. The slow rate bound (]3|) can be improved if \\Po\\i is large. Indeed, we 
proof in the Appendix that 

\\X0 - Po)\\l < 2Amin{||A,||i , \\0 - A)kl|i} 

on the set T. That is, the prediction error can be bounded both with ||/?o||i an d with 
the l\ estimation error restricted on the sparsity pattern of j3 . The latter term can 
be considerably smaller than ||/3o||i (in particular for weakly correlated designs). A 
detailed analysis of this observation, however, is not within the scope of the present 
paper. 



4 



3 The Lasso and Correlations 



We show in this section that correlations strongly influence the optimal tuning pa- 
rameters. Moreover, we show that - for suitably chosen tuning parameters - Lasso 
performs well in prediction for different levels of correlations. For this, we first present 
simulations where we compare Lasso prediction for an initial design with Lasso pre- 
diction for an expanded design with additional impertinent variables. Then, we 
discuss the theoretical aspects of correlations. We introduce, in particular, a simple 
and illustrating notion about correlations. Further simulations finally confirm our 
analysis. 

3.1 The Lasso on Expanded Design Matrices 

Is Lasso prediction becoming worse when many impertinent variables are added to 
the design? Regarding the bounds and the usual value of the tuning parameter A 
described in the last section, one may expect that many additional variables lead to 
notably larger optimal tuning parameters and prediction errors. However, as we see 
in the following, this is not true in general. 

Let us first describe the experiments. 

Algorithm 1 We simulate from the linear regression model ([I]) and take as input 
the number of observations n, the number of variables p, the noise level a, the number 
of nonzero entries of the true solution s := {j : (Po)j ^ 0} and finally a correlation 
factor p G [0, 1). We then sample the n independent rows of the design matrix 
X from a normal distribution with mean zero and covariance matrix with diagonal 
entries equal to 1 and off-diagonal entries equal to p, and we normalize X such that 
(X T X)jj = n for 1 < j < n. Then, we define (fio)i := 1 for 1 < i < s and (/3 )i := 
otherwise, sample the error e from a standard normal distribution and compute the 
response vector Y according to ([T]). After calculating the Lasso solution (3, we finally 
compute the prediction error ||X(/3 — /5 ) 1 1 2 f° r different tuning parameters A and 
find the optimal tuning parameter, that is, the tuning parameter that leads to the 
smallest prediction error. 

Algorithm 2 This algorithm only differs from the above algorithm in one point. In 
an additional step after the initial design matrix X is sampled, we add for each col- 
umn X® of the initial design matrix p - 1 columns sampled according to X^' + r/N. 
We finally normalize the resulting matrix. The parameter 77 controls the correlation 
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among the added columns and the initial columns and N is a standard normally 
distributed random vector. Compared to the initial design, we have now a design 
with p 2 - p additional impertinent variables. 



Several algorithms for computing a Lasso solution have been proposed: For ex- 
ample, using interior point methods [CDS98J, using homotopy parameters [EHJT04 



IOPT00[ ITur05 , or using a so- called shooting algorithm [DD DM04t lFu98j IFHHT07J. 



We use the LARS algorithm introduced in [EHJT 04]. since, among others, Bach et 
al. |BJM01lj Section 1.7.1] have confirmed the good behavior of this algorithm when 
the variables are correlated. 

Parameters: n 
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Figure 1: We plot the mean values PE of the prediction errors \\A(3 - A(3 \\l for 1000 
iterations as a function of the tuning parameter A. The blue, dashed line corresponds 
to Algorithm 1, where A stands for the initial design matrices. The blue, dotted lines 
give the confidence bounds. The red, solid line corresponds to Algorithm 2, where A 
represents the extended matrices. The faint, red lines give the confidence bounds. The 
parameters for the algorithms are given in the header. The mean of the optimal tuning 
parameters is 3.62 ± 0.01 for Algorithm 1 and 3.63 ± 0.01 for Algorithm 2. These values 
are represented by the blue and red vertical lines. 



Results We did 1000 iterations of the above algorithms for different A and with 
n = 20, p = 40, s = 4, a = 1, p = and with 77 = 0.001 (Figure [TJ and 77 = 0.1 
(Figure [2]). We plot the means PE of the prediction errors as a function of A. The 
blue, dashed curves correspond to the initial designs (Algorithm 1), the red, solid 
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Parameters: n = 20, p = 40, s = 4, a = 1 , p = 0, r| = 0.1 




2 4 6 8 10 

X 

Figure 2: We plot the mean values ~PE of the prediction errors \\Af3 - A(3 \\l for 1000 
iterations as a function of the tuning parameter A. The blue, dashed line corresponds to 
Algorithm 1, where A stands for the initial design matrices. The blue, dotted lines give 
the confidence bounds. The red, solid line corresponds to Algorithm 2, where A stands for 
the extended matrices. The faint, red lines give the confidence bounds. The parameters 
for the algorithms are given in the header. The mean of the optimal tuning parameters is 
3.63 ± 0.01 for Algorithm 1 and 4.77 ± 0.01 for Algorithm 2. These values are represented 
by the blue and red vertical lines. 



curves correspond to the extended designs (Algorithm 2). The confidence bounds 
are plotted with faint lines in the according color and finally the mean values of the 
optimal tuning parameters are plotted with vertical lines. 

We find in both examples that the minimal prediction errors, that is, the minima of 
the red and blue curves, do not differ significantly. Additionally, in the first example, 
corresponding to highly correlated added variables (rj = 0.001, see Figured]), also 
the optimal tuning parameters do not differ significantly. However, in the second 
example (rj = 0.1, see Figure [2]), the optimal tuning parameter is considerably larger 
for the extended designs. 

First Conclusions Our results indicate that tuning parameters proportional to 
y/ n log p (cf. |BRT09| and most other contributions on the subject) independent of 
the degree of correlations are not favorable. Indeed, for Algorithm 2, this would 
lead to a tuning parameter proportional to yj n log p 2 = y/2n \ogp, whereas for Al- 
gorithm 1 to y/n log p. But regarding Figure [H the two optimal tuning parameters 
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are nearly equal and hence these choices are not favorable. In contrast, the re- 
sults illustrate that the optimal tuning parameters depend strongly on the level of 
correlations: for Algorithm 2 (red, solid curves), the means of the optimal tuning 
parameters corresponding to the highly correlated case (3.63 ±0.01, see Figured]) are 
be considerably smaller than the ones corresponding to the weakly correlated case 
(4.77 ±0.01, see Figure E). 

Our results indicate additionally that the minimal mean prediction errors are com- 
parable for all cases. This implies, that a suitable tuning parameters lead to good 
prediction even with additional impertinent parameters. We only give two examples 
here but made these observations for any values of n, p and s. 

3.2 Theoretical Evidence 

We provide in this section theoretical explanations for the above observations. For 
this, we first discuss results derived in |vdGL12] . They find that high correlations 
allow for small tuning parameters and that this can lead to bounds for Lasso pre- 
diction that are even more favorable than the fast rate bounds. Then, we introduce 
and apply new correlation measures that provide some insight for (in contrast to 
|vdGL12] ) arbitrary degrees of correlations. For no correlations, in particular, these 
results simplify to the classical results. 

3.2.1 Highly Correlated Designs 

First results for the highly correlated case are derived in |vdGL12] . Crucial in their 
study is the treatment of the stochastic term with metric entropy. 

The bound on the prediction error for Lasso reads as follows: 
Lemma 3.1. fvdGLlfy Theorem 4-1 & Corollary 4-2] On the set 

J 2a\fXP\ - r) 

r ° : ~ IT WWWi - J 

2 a — 1 

we have for A = (2An Q " 1 )^ \\f3 \\l +a and < a < 1 

PX9-A0II2< y {^n-- 1 )^ Wo\\P ■ 

We show in the following that for high correlations the stochastic term T a has a high 
probability even for small a and thus favorable bounds are obtained. The parame- 
ter a can be thought of as a measure of correlations: a small corresponds to high 
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correlations, a large corresponds to small correlations. 



The stochastic term T a is estimated using metric entropy. We recall, that the 
covering numbers N(5, J 7 , d) measure the complexity of a set J 7 with respect to a 
metric d and a radius 5. Precisely, N(S, J 7 , d) is the minimal number of balls of radius 
S with respect to the metric d needed to cover J 7 . The entropy numbers are then 
defined as H{5,J z ,d) := log N(5, J 7 , d). In this framework, we say that the design 
is highly correlated if the covering numbers N(S,scoiav{X^ 1 \ ...,X^}, \\ ■ H2) (or the 
corresponding entropy numbers) increase only mildly with 1/5, where {X^\ X^} 
are the columns of the design matrix and sconv denotes the symmetric convex hull. 
This is specified in the following lemma: 

Lemma 3.2. fvdGL12\ Corollary 5.2] Let < a < 1 be fixed. Then, assuming 

log (1 + N(^5, sconv{X«, X®}, || • || 2 )) < (j\ , < 5 < 1, (4) 

there exists a value C(a,A) depending on a and A only such that for all k > and 
for 

X = aC(a,A)^n 2 - a \og(2/n), 
the following bound is valid: 

H%) > 1 - k. 

We observe indeed that the smaller a is, the higher the correlations are. We also 
mention that Assumption (j4]) only applies to highly correlated designs. An example 
is given in |vdGL12] : the assumption is met if the eigenvalues of the Gram matrix 
decrease sufficiently fast. 

Lemma 13.11 and Lemma 13.21 can now be combined to the following bound for the 
prediction error of the Lasso: 

Theorem 3.1. With the choice of X as in Lemma \3.1\ and under the assumptions of 
Lemma \3. 2\ it holds that 



\X0 - AOHij < y (aC(a, A)y/n°]Dg{2/ K j) ^ \\fo\\f 



l + a 



2 

with probability at least 1 — k. 

High correlations allow for small values of a and lead therefore to favorable bounds. 
For a sufficiently small, these bounds may even outmatch the classical fast rate 
bounds. For moderate or weak correlations, however, Assumption (J3J) is not met and 
therefore the above lemma does not apply. 
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3.2.2 Arbitrary Designs 



In this section, we introduce bounds that apply to any degree of correlations. The 
correlations are in particular allowed to be moderate or small and the bounds sim- 
plify to the classical results for weakly correlated designs. We first introduce two 
measure for correlations that are then used to bound the stochastic term. These 
results are then combined with the classical slow rate bound to obtain a new bound 
for Lasso prediction. We finally give some simple examples. 

Two Measures for Correlations We introduce two numbers that measure the 
correlations in the design. For this, we first define the correlation function K : Rq — >■ 
N as 



where S™"^ 1 denotes the unit sphere in W 1 . We observe that K(x) < p for all x G Rq 
and that K is a decreasing function of x. A measure for correlations should, as the 
metric entropy above, measure how close to one another the columns of the design 
matrix are. Indeed, for moderate x, K(x) ~ p for uncorrelated designs, whereas K(x) 
may be considerably smaller for correlated designs. This information is concentrated 
in the correlation factors that we define as 



Since K(0) < p, it holds that F,K K G (0, 1]. We also note that similar quantities 
could can be defined for p = oo (removing the normalization). In any case, large F 
and K K correspond to uncorrelated designs, whereas small F and K K correspond to 
correlated designs. 

Control of the Stochastic Term We now show that small correlation factors 
allow for small tuning parameters and thus lead to favorable bounds for Lasso pre- 
diction. Crucial in our analysis is again the treatment of a stochastic term similar 



K(x) := min{/ G N : 3x (1) , . . . , x {l) G VnS n ~\ 

X (m) G (1 +x) sconv{x (1) ,...,x (0 } VI < m < p}, 



(5) 




k G (0, 1], and as 
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to the one above. 



We prove the following bound in the appendix: 

Theorem 3.2. With the definitions above, T as defined in Section [H and for all 
k > and X > X K := K K 2o^/2n log(2p/ k), it holds that 

P (T) > 1 - k. 
Additionally, independently of the choice of X, 



E 



sup er | e X (3 

\\P\\i<M 



<Fad 8nhg{1+P) M. 



For small correlation factors K K and F, the minimal tuning parameters X K and the 
expectation of the stochastic term are small. For K K — > 1, the minimal tuning pa- 
rameters simplify to 2o~^j2n log(2p//t) (cf. [BRT09J). Similarly, for F — > 1, the 

expectation of the stochastic term simplifies to o~\J &n log g 1+p ^ M. 

Together with the slow rate bound introduced in Section [2J this permits the 
following bound: 

Corollary 3.1. For X > X K it holds that 

\\X0-p o )\\i<2X\\p o \\ 1 

with probability at least 1 — k. 

Our contribution to this result concerns the tuning parameters: for the classical 
value A = 2a 2n log(2p/ k) , the bound simplifies to the classical slow rate bounds. 
Correlations, however, allow for smaller A and thus lead to more favorable bounds. 

Let us have a look at some general aspects of the above bound. Most importantly, 
Corollary 13.11 applies to any degree of correlations. In contrast, Theorem 13.11 only 
applies for highly correlated designs, and the classical fast rate bounds only apply 
for weakly correlated designs. We also observe that the sparsity index s, which 
appears in the classical fast rate bounds, does not appear in Corollary 13.11 Hence, 
Corollary 13 . 1 1 can be useful even if the true regression vector (3q is only approximately 
sparse. However, for very large ||/3o||i, the bound is unfavorable. An example in this 
context can be found in |CP09| Section 2.2]. (However, we do not agree with their 
conclusions corresponding to this example. We stress, in contrast, that correlation 
are not problematic in general.) We finally refer to Remark 12.11 and to |MM11] for 
some additional considerations on the optimality of this kind of bounds. 
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Remark 3.1 (Weakly Correlated Designs). Corollary lff.il holds for any degree of 
correlations because it follows directly from the classical slow rate bound and the 
properties of the refined tuning parameter. However, for weakly correlated designs, 
we can also make use of the refined tuning parameter to improve the classical fast 
rate bound by a factor K 2 K . Usually, the gain is moderate, but in special cases such as 
sparse designs (see Example \3.2\) . the gain can be large. Finally, we note that optimal 
rates, that is, lower bounds for the prediction error \\X((3 — /?o)||«L are available for 
weakly correlated designs. The optimal rate is then slog(l + ^) and is deduced from 
Fano's Lemma (see fBTW07a\ Theorem 5.1, Lemma A.lJ). For similar results in 
matrix regression, we refer to jKTLll^ . where the assumptions on the design needed 



for such lower bounds are explicitly stated (see, in particular, their Assumption 2). 

Examples In this final section, we illustrate some properties of the correlation 
numbers K K and F for various settings. We consider, in particular, design matrices 
with different geometric properties and with different degrees of correlations. 

^-t- Example 3.1 (Low Dimensional Design). Let dim span{X^\ 1^} < W. 
Then, 

X^ G VW sconvjxi, Xw} for all 1 < j < p 
for properly chosen X\, Xw £ \/nS n ~ l (for example orthogonal vectors in a suitable 
subspace). Hence, K K < (1 + VW)J^§ and F < (1 + VW)J^^. 



^-t- Example 3.2 (Sparse Design). Let the number of non-zero coefficients in X^' be 
such that \\X®\\ Q < d for all 1 < j < P, with d < n. Then, 

X^ G Vusconv{xi, ■■■,x n } for all 1 < j < p 

for properly chosen Xi,...,x n G y / nS' n ~ 1 (for example orthogonal vectors in W 1 ). 
Hence, K K < vW|gf and F < VdJ^. 



Example 3.3 (Weakly Correlated Design). We consider a weakly correlated design 
with p ^> n. It turns out that the results in Section \3. 2. 1\ lead to a tuning parameter 
A ~ n . This implies in particular that the results in Section \3.2.1\ unlike the results 
in this section, do not simplify to the classical results for uncorrelated designs. We 
give a sketch of the proof in the Appendix. 
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^-t- Example 3.4 (Highly Correlated Design). We illustrate with an example that highly 
correlated designs indeed involve small correlation factors K K . This implies, in ac- 
cordance to the simulation results, that tuning parameters depending on the design 
rather than tuning parameters depending only on a, n, and p should be applied. 
Let us describe how the design matrix X is chosen: Fix an arbitrary n dimensional 
vector of Euclidian norm y/n. Then, we add p — 1 columns sampled according 
to X^ + uN where v is a positive (but small) constant and N is a standard normally 
distributed random vector. We finally normalize the resulting matrix. 
For simplicity, we do not give explicit constants but present a coarse asymptotic re- 
sult that highlights the main ideas. For this, we consider the number of variables p 
as a function of the number of observations n and impose the usual restrictions in 
high dimensional regression, that is, p ~ n w for a w G N. // the correlations are 
sufficiently large, more precisely v < ^=^ ; it holds that 

K K — > —= as n — > oo 

y/W 

with probability tending to 1. This implies especially that the standard tuning param- 
eters ~ o~\/n\ogp can be considerably too large if v is large. The proof of the above 
statement is given in the Appendix. 



)• Example 3.5 (Equal Columns). Let the cardinality of the set \{X^ : 1 < j ' < 



Io S (2v/k) nmA t? / /log(l+u) 



/') '•• Tl " "- K - < V i|S a7ld F < \ I.-;,, 

3.3 Experimental Study 

We consider Algorithm 1 with different sets of parameters to make statements about 
the influence of the single parameters on Lasso prediction. In particular, we are 
interested in the influence of the correlations p. 



Results We collect in Table [T] the means of the optimal tuning parameters X m in 
and the means of the minimal prediction errors PE m i n for 1000 iterations and dif- 
ferent parameter sets. Let us first highlight the two most important observations: 
first, correlations (p large) lead to small tuning parameters. Second, correlations do 
not necessarily lead to high prediction errors. In contrast, the prediction errors are 
mostly smaller for the correlated settings. 

Let us now make some other observations. First, we find that the optimal tuning 
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Table 1: The means of the optimal tuning parameters \ m in and the means of the 
minimal prediction errors PE m i n calculated according to Algorithm 1 with 1000 
iterations and for different sets of parameters. 
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parameters do not increase considerably when the number of observations n is in- 
creased. In contrast, the means of the minimal prediction errors decrease for the 
correlated case as expected, whereas this is not true for the uncorrelated case. 
Second, increasing the number of variables p leads to increasing optimal tuning pa- 
rameters as expected (interestingly by factors close to J ^jjpp , cf. Corrollary 13. ip . 
The means of the minimal prediction errors do, surprisingly, not increase consider- 
ably. 

Third, as expected, increasing the sparsity s does not considerably influence the 
optimal tuning parameters but leads to increasing means of the minimal prediction 
errors. 

Forth, for o = 3 both the optimal tuning parameter as well as the mean of the mini- 
mal prediction error increase approximately by a factor 3. The mean of the minimal 
prediction errors for a = 3 and p = is an exception and remains unclear. 
We finally mention that we obtained analogeous results for many other values of /3q 
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and sets of parameters. 

Conclusions The experiments illustrate the good performance of the Lasso es- 
timator for prediction even for highly correlated designs. Crucial is the choice of 
the tuning parameters: we found that the optimal tuning parameters depend highly 
on the design. This implies in particular that choosing A proportional to a/ n log p 
independent of the design is not favorable. 

4 Discussion 

Our study suggests that correlations in the design matrix are not problematic for 
Lasso prediction. However, the tuning parameter has to be chosen suitable to the 
correlations. Both, the theoretical results and the simulations strongly indicate that 
the larger the correlations are, the smaller the optimal tuning parameter is. This 
implies in particular, that the tuning parameter should not be chosen only as a func- 
tion of the number of observations, the number of parameters and the variance. The 
precise dependence of the optimal tuning parameter on the correlations is not known, 
but we expect that cross validation provides a suitable choice in many applications. 
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Appendix 

Proof of Remark \2.1[ By the definition of the Lasso estimator, we have 

\\Y - XP\\1 + AplK < \\Y - Xfo\\l + Apo||i. 

This implies, since Y = X(3q + ae, that 

11*03 - A>) ||! < 2a\e T X0 - /3 )| + A||/3 ||i - \\\/3\\i. 

Next, on the set T, we have 2a\e T X0 — O )\ < A||/3 — 0o\\i- Additionally, by the 
definition of Jq, 

11/3 -Mi = W0- A))jblli + llAdli- 
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Combining these two arguments and using the triangular inequality, we get 

\\X0-Po)\\l < X\\0-Mj o \\i + X0j S \\i + X\Wo\\i-X\W\\i 
< 2A||G8-A)klli- 
On the other hand, using the triangular inequality, we also get 

\\X0 - o )\\l < - 0o\\i + ApoUx - AH/3IK < 2A||/3 ||i. 
This completes the proof. 

□ 

Proof of Theorem \3.2[ We first show that for a fixed x £ Rj, the parameter space 
{/3 £ M p : \\/3\\i < M} can be replaced by the dimensional parameter space 

{(3 £ 1^"' : ||/3||i < (1 +x)M}. Then, we bound the stochastic term in expectation 
and probability and eventually take the infimum over x £ to derive the desired 
inequalities. 

As a start, we assume without loss of generality a — 1, we fix x £ Kg and set 
K := K(x). Then, according to the definition of the correlation function (jSJ), there 
exist vectors x^\ x^ £ y/nS 11-1 and numbers {/^(m) : 1 < j < K} for all 
1 < m < p such that = Ylj=i Kj(m)x^ and Ylf=i \ K j( m ) \ — (1 + x )- Thus, 

p p K K p 

(X(3)i = Y, X< T ] ^ = EE K >')-'<" i,„ = E*! J) E 

m=l m=l j=l j=l m=l 

and additionally 

if p p if 

E i E ^ E E M m )i ^ + ^i^iii- 

j=l m=l m=l j'=l 

These two results imply 

sup I e T Xf3 |< _ sup | e T X/3 |, 

l|/3||l<M !l/3||i<(l+x)M 

where /3 £ R A and X := (a^ 1 ), ... } x^). That is, we can replace the p dimensional 
parameter space by a K dimensional parameter space at the price of an additional 
factor 1 + x. 
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We now bound the stochastic term in expectation. First, we obtain by Cauchy- 
Schwarz's Inequality 



E 



sup | e Xj3 

||l<(l+z)M 



E 



n K 

sup iEE e ^?^ 



< 



E sup 

\\Ml<(l+x)M 

l + x)ME 



i max I e T X w 



max I e T X^ 
i<i<K 



Next (cf. the proof of |vdGLlll Lemma 3]), we obtain for := e x2 — 1 



E 



max | e 

l<j<K 



T X (J) 



< m- l (K) max ||e T X w |L, 
i<i<x 



where || ■ ||* denotes the Orlicz norm with respect to the function \1/ (see |vdGLll] for a 

8 
3 



definition). Since eTx J? ) is standard normally distributed, we obtain ||^-^— 



(see for example |vdVW00j Page 100]). Moreover, one may check that \E' _1 (y) 
A/log(l + y). Consequently, 



E 



sup | e Xf3 

||/8||i<(l+a!)M 



< (1 + x)M 



8n\og(l + K) 



One can then derive the second assertion of the theorem by taking the infimum over 



x G 



As a next step, we deduce similarly as above (compare also to [BRT09J) 

A 



P | sup 2 | e 1 Xf3 |> AM I <P I sup 2 | e T X(3 \> 

\\P\\i<M I \ ilSli.^i 1 + x 



Jl/3||i<l 



<K max P 



KF H > 



> 



A 



2(1 +x)^ 



A 



2(l + x)^ y 

where 7 is a standard normally distributed random variable. Setting A := 2(1 + 
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x * ) 


j < 2K exp \ - 


8(l + x) 2 n) 





x)^2n log ( 2 i^/ ft), we obtain 

P _ 

\\\Ph<M 

= 2Kexp (-log(2tf/«)) 

= K. 

The first assertion can finally be derived taking the infimum over x G M.^ and apply- 
ing monotonous convergence. □ 

Proposition 1. It holds for all < 8 < 1 

<N(5,{xeW n : |M| 2 <i},|H| 2 )< I- 

Proof. If a collection of balls covers the set {x 6 K n : ||x||2 < 1}, the volume covered 
by these balls is at least as large as the volume of {x G M n : ||x|| 2 < 1}. Thus, 

N(5,{x G W 1 : \\x\\ 2 < 1}, || ■ || 2 ) ■ 8 n > 1, 

and the left inequality follows. The right inequality can be deduced similarly. □ 

Proof Sketch for Example \3.3[ For weakly correlated designs and for p very large, it 
holds that 

log(l + ^(v^5,sconv{X (1) ,...,X (p) },|| • || 2 )) 
^log(iV(5,{xGR" : ||x|| 2 < 1},|| • || 2 )). 

We can then apply Proposition [T] to deduce 

log(l + X(v^5,sconv{X (1) ,...,X {?5) }, || ■ || 2 )) « nlog(l/5). 

Hence, Lemma 13721 requires that A fulfills for all < 5 < 1 (in fact, it is sufficient for 
t= < S < 1, see the proofs in |LedlO] for this refinement) 

f A\ 2a 

(j) «nlog(l/(5) 

and thus 

A a « 6 a y/nlog(l/6). 

We can maximize the right hand side over < 5 < 1 and plug the result into 
|vdGL12| Corollary 5.2] to deduce the result. □ 
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Proof for Example 3.4 . The realizations of the random vectors + z/iV are clus- 
tered with high probability and can thus be described (in the sense of the definition 
of K K ) by much less than p vectors. 

To see this, consider and a rotation R 6 SO(n) such that RX^ = (y/n, 0, . . . , 0) T . 
Moreover, let N be a standard normally distributed random vector in IR n , n > 2. 
The measure corresponding to the random vector N is denoted by P. We then obtain 
with the triangle inequality and the condition on v 



P 



1 



a 



\RXW + uRN\ 



> 1 



=P 



\RX<U + uRN\\ 2 1 



n 



< - 
2 



<r{ \\X {1) h-_HNh < 1 



v 71 

- , r v\\Nh 1 
Gi ~ 2 



<P IIMIa > 



2u 



<P(||JV|| 2 > v 7 ^) . 



Now, we can bound the l\ distance of the vector 
to our setting to the vector RX^ 1 ': 



y/n{RX { V+vRN) 
\\RXW+uRN\\ 2 



generated according 



p ( - f (J ff' )+ T ) 

V \\RX<V + vRN\\ 2 



<P 
<P 



n 



\\RXP> + vRN\\ 2 



\RXW + uRN\ 



> 2v^ 

> yfnj + F(\\uRN\\ 1 > \\RX {1) + uRN\\ 2 ) 
> 1^ + P(vW||iV|| 2 > Vn-v\\N\\ 2 ) 



<P (||iV|| 2 > V2^j + P (j|iV|| 2 > i-J 
<2F(\\N\\ 2 > v^n) . 



It can be derived easily from standard results (see for example |BvdGll| Page 254]) 
that 



P 



(\\N\\ 2 > v^) < exp . 
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Consequently, 



(6) 



/ m Jn(RX^ + vRN) r-\ ( n\ 

p ( Rxl ' " T^W , - H £ 2exp (■») ' 

Finally, define for any < z < 2A/n the set 

C z := {(v^, 0, . . . , Of, (Vn, -2, 0, ... , 0) T , (v^, 0, z, 0, . . . , Of, 

(v^,o r z,o,...,o) T ,...}cr 

These vectors will play the role of the vectors x^ 1 ', ...,x® in Definition (j3J). It holds 
that 

card(C^) = 2(n - 1) (7) 

and 

C z c v^T^ (8) 

Moreover, 

{x G R n : ||a; - RX {1) \\i < z) n V^" 1 C sconv(C 2 ). (9) 
Inequality and Inclusion (jUJ) imply that with probability at least 1 — 2 exp (-~J 



\RX0) + isRN\\ 2 



E sconv(i? T C 2 



Thus, using Equality fifty and Inclusion (181) . we have lf(\/5 — 1) < 2(n — 1) with 
probability at least 1 — 2(p — 1) exp (-§)■ □ 



[BC11] A. Belloni and V. Chernozhukov. High dimensional sparse econometric 
models: An introduction. In P. Alquier, E. Gautier, and G. Stoltz, 
editors, Inverse Problems and High- Dimensional Estimation. Springer 
(Lecture Notes in Statistics), 2011. 

[BJMOll] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization 
with sparsity-inducing norms. In S. Sra, S. Nowozin, S. J. Wright., 
editors, Optimization for Machine Learning, MIT Press, 2011. 

[BRT09] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of lasso and 
Dantzig selector. Ann. Statist, 37(4):1705-1732, 2009. 



20 



[BTW07a] F. Bunea, A. Tsybakov, and M. Wegkamp. Aggregation for Gaussian 
regression. Ann. Statist, 35(4): 1674-1697, 2007. 

[BTW07b] F. Bunea, A. Tsybakov, and M. Wegkamp. Sparsity oracle inequalities 
for the Lasso. Electron. J. Stat, 1:169-194, 2007. 

[BvdGll] P. Buhlmann and S. van de Geer. Statistics for High Dimensional Data. 
Methods, Theory and Applications. Springer, 2011. 

[CDS98] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis 
pursuit. SI AM J. Sci. Comput, 20(1):33-61, 1998. 

[CP09] E. Candes and Y. Plan. Near-ideal model selection by t\ minimization. 
Ann. Statist, 37(5A), 2009. 

[DDDM04] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding 
algorithm for linear inverse problems with a sparsity constraint. Comm. 
Pure Appl. Math., 57(11):1413-1457, 2004. 

[EHJT04] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle re- 
gression. Ann. Statist., 32(2):407-499, 2004. With discussion, and a 
rejoinder by the authors. 

[FHHT07] J. Friedman, T. Hastie, H. Honing, and R. Tibshirani. Pathwise coordi- 
nate optimization. Ann. Appl. Stat., l(2):302-332, 2007. 

[Fu98] W. Fu. Penalized regressions: the bridge versus the lasso. J. Comput. 
Graph. Statist, 7(3):397-416, 1998. 

[HCB08] C. Huang, G. Cheang, and A. Barron. Risk of penalized least squares, 
greedy selection and LI penalization for flexible function libraries. Sub- 
mitted to Ann. Statist., 2008. 

[HTF01] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical 
learning. Springer Series in Statistics. Springer- Verlag, New York, 2001. 
Data mining, inference, and prediction. 

[KTL11] V. Koltchinskii, A. Tsybakov, and K. Lounici. Nuclear norm penalization 
and optimal rates for noisy low rank matrix completion. Ann. Statist., 
39(5):2302-2329, 2011. 

[LedlO] J. Lederer. Bounds for rademacher processes via chaining. Technical 
Report ETH Zurich, 2010. 



21 



[MM11] 



P. Massart and C. Meynet. The Lasso as an £i-ball model selection 
procedure. Electron. J. Stat, 5:669-687, 2011. 



[OPT00] M. Osborne, B. Presnell, and B. Turlach. On the LASSO and its dual. 
J. Comput. Graph. Statist, 9(2):319-337, 2000. 

[RT11] P. Rigollet and A. Tsybakov. Exponential Screening and optimal rates 
of sparse estimation. Ann. Statist., 39(2):731-771, 2011. 

[Tib96] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. 
Statist Soc. Ser. B, 58(l):267-288, 1996. 

[Tur05] B. Turlach. On algorithms for solving least squares problems under 
an £i penalty or an l\ constraint. In 2004 Proceedings of the American 
Statistical Association. Statistical Computing Section [CD-ROM], 2572 — 
-2577. Alexandria, VA, 2005. 

[vdGB09] S. van de Geer and P. Buhlmann. On the conditions used to prove oracle 
results for the lasso. Electron. J. Stat, 3:1360-1392, 2009. 

[vdGLll] S. van de Geer and J. Lederer. The bernstein-orlicz norm and deviation 
inequalities, preprint, 2011. 

[vdGL12] S. van de Geer and J. Lederer. The lasso, correlated design, and improved 
oracle inequalities. Ann. Appl. Stat., 2012. To appear. 

[vdVWOO] A. van der Vaart and J. Wellner. Weak Convergence and Empirical 
Processes: With Applications to Statistics. Springer, 2000. ISBN 0-387- 
94640-3. 



22 



